HLearn: A Machine Learning Library for Haskell
نویسنده
چکیده
HLearn is a Haskell-based library for machine learning. Its distinguishing feature is that it exploits the algebraic properties of learning models. Every model in the library is an instance of the HomTrainer type class, which ensures that the batch trainer is a monoid homomorphism. This is a restrictive condition that not all learning models satisfy; however, it is useful for two reasons. First, this property lets us easily derive three important functions for machine learning algorithms: online trainers, parallel trainers, and fast cross-validation algorithms. Second, many popular algorithms (or variants on them) satisfy the condition and are implemented in the library. For example, the HLearn library implements the standard version of many distribution estimators and Bayesian classification, as well as homomorphic variants of perceptrons, kd-trees, decision trees, ensemble algorithms, and clustering algorithms. Furthermore, many of these learning models have additional algebraic structure that the HLearn library exploits. In particular, if a model has Abelian group structure, then we can perform more efficient cross-validation; and if it has R-module structure, then we can use weighted data points. Hopefully, this algebraic framework makes it easier to incorporate machine learning into the average Haskell application. 1 Why another library for machine learning? Machine learning libraries need to be fast. In order to get this speed, the most popular libraries are written in low level languages (see Table 1). Unfortunately, this emphasis on speed has meant that the libraries are often inconvenient to use. Current practice is to provide bindings to these low level libraries in higher level languages (e.g. R, Matlab, Python, and even Haskell); but this still leaves much to be desired. These interfaces are not standardized and require specialist knowledge to understand and use. In practice, they are rarely used by the average programmer writing an average program. The machine learning community is well aware of this deficiency, although there is relatively little effort to fix it [20]. The goal of the HLearn library is to change this status quo and make machine learning techniques easily usable by non-specialists. We don’t claim to have solved this problem, but only that we’re aiming in that direction. We do this by characterizing learning models according to their algebraic structure. This is a 1 The H in HLearn stands for both Haskell (because that is the language the library is written in) and homomorphism (because all batch trainers in the library must be homomorphisms; see section 2). Table 1. Most popular machine learning libraries are written in non-functional languages. Weka is the most fully featured of these packages, and it is the easiest for a novice to use. It is no coincidence that it is written in the highest level language. Library Language C4.5 Decision Trees C Fast Artificial Neural Networks (FANN) C Stuttgart Neural Network Simulator (SNNS) C Support Vector Machines Light (SVMLight) C Library for Support Vector Machines (LibSVM) C++ and Java Open Computer Vision (OpenCV) C++ Weka Java powerful design pattern commonly used in functional programming libraries [21]. This pattern makes libraries easy to build and maintain by reducing the amount of boilerplate code. More importantly, it makes libraries easy to use—once a user understands an algebraic structure, then she automatically understands all instances of that structure. For example, the normal distribution forms a vector space, but Markov chains only form a monoid. As we shall see later, this means that they can use the same functions for online and parallel training, but that normal distributions support more efficient cross-validation and the weighting of data points. The user doesn’t have to know anything about how these models actually work, just what algebraic structure they have. Of course, there have been many other attempts to write machine learning algorithms in a functional language. The fact that probability distributions have a monad structure [9, 12, 18] has formed the foundation for a number of libraries for probabilistic programming [2, 7, 15]. The HLearn library is orthogonal to this work, and in principle both designs could exist side-by-side. Other attempts to integrate functional programming with machine learning have not used algebra [3, 14, 1]. Instead, they use the power of type classes, higher order functions, and pattern matching to express learning algorithms in a functional setting. The HLearn library incorporates much of these ideas. Finally, we note that all of this previous work has taken place within the Haskell language. Haskell is a good choice for this type of experimentation because it is both fast and was designed from the beginning to incorporate experimental language features [11, 17]. HLearn relies on a number of recent language extensions implemented in the Glasgow Haskell Compiler (GHC), such as TypeFamilies, GADTs, DataKinds, ConstraintKinds, and TemplateHaskell. Besides the theoretical advantages of the HLearn library, there is also a practical advantage: a standardized interface for many learning tasks. While it’s true that there are a number of excellent Haskell packages for machine learning, unfortunately each of these packages uses a different interface. This makes it 2 There are too many packages to describe them all here. For a complete list, visit the Hackage repository (http://hackage.haskell.org), and look under the sections: statistics, artificial intelligence, machine learning, and data mining. difficult to compose learning routines, ruining one of the main advantages of functional programming. For example, the statistics package assumes that all input data points will be stored inside unboxed vectors, whereas the KdTree package requires data points stored in lists. The HLearn library has no such requirements—we can store our data points however is most convenient for our particular application. We only require that the container be a foldable functor. One neat trick this lets us do is work with data sets larger than memory by using containers that seamlessly swap to and from disk. The remainder of this paper focuses on HLearn’s internal mechanics. Section 2 describes the HomTrainer type class, and why it accurately captures our notions of what it means to be a learning model. The HomTrainer class gives us a simple method for defining new models, and useful bounds on their training time. Section 3 looks at other algorithms for manipulating algebraic models. These let us automatically parallelize our training procedures, perform asymptotically faster cross-validation, and apply weightings to our data points. Section 4 concludes with the future of the HLearn library. Finally, the haddock documentation contains tutorials and further details on practical usage of the library. 2 The HomTrainer type class Every learning model is represented by a data type, and that type must be an instance of HomTrainer. Table 3 lists all current instances. The wide selection of models—there are statistical distributions, classifiers, unsupervised learners, Markov chains, and even NP-approximation algorithms—demonstrates the versatility of the HomTrainer class. In this section, we will show why the class is also powerful. For each of these models, the HomTrainer class associates a unique Datapoint type and provides four training functions (see Code Snippet 1). The two most important training functions are the batch trainer train and the online trainer add1dp. The batch trainer takes a set of data points and returns the corresponding trained model. Typically, we would use it when analyzing historical data that was generated by some previous process. In contrast, the online trainer is used for analyzing data as it is generated. It takes an already trained model and “adds” a data point to the model. Which is more useful depends on our particular application. There is a lot of interest in the machine learning community about the relationship between online and batch training. In general, online training is much harder [16, 4, 13, 6]. For our purposes, this means that not every learning model has a known online trainer that would satisfy the laws of the HomTrainer type class (discussed below). By limiting ourselves in this way, we gain a simpler, more powerful, easier to use interface. This tradeoff is reasonable because many popular learning algorithms have variants that do satisfy the HomTrainer laws. The HomTrainer class also includes two other functions. First, the singleton trainer train1dp is included because it often makes defining new models easier. 3 http://hackage.haskell.org/package/HLearn-algebra Code Snippet 1 The HomTrainer type class class (Monoid model) ⇒ HomTrainer model where type Datapoint model -The singleton trainer train1dp :: Datapoint model → model -The batch trainer train :: (Functor container, Foldable container) ⇒ container (Datapoint model) → model -The online trainer add1dp :: model → Datapoint model → model -The online batch trainer addBatch :: (Functor container, Foldable container) ⇒ model → container (Datapoint model) → model As Section 2.1 shows, we only need to implement one of the training functions and the rest can be derived automatically. The singleton trainer can often be implemented in only a single line of code. Second, the online batch trainer addBatch is included for efficiency reasons. If we have a large list of data points to add to our model, it is more efficient to add them all at a single time than it is to add them one-by-one. Every instance of HomTrainer must obey four laws. First, the batch trainer must be a monoid homomorphism. That is, train (xs + ys) = (train xs) (train ys) The next three laws ensure that no matter how we train our model, as long as we use the same data points we will get the same model: add1dp (train xs) x = train (xs + [x]) train1dp x = train (point x) addBatch (train xs) ys = train (xs + ys) Next, we discuss how to define new HomTrainers and the complexity of the resulting training functions.
منابع مشابه
HasGP: A Haskell library for Gaussian process inference
HasGP is a library providing supervised learning algorithms for Gaussian process (GP) regression and classification. While only one of many GP libraries available, it differs in that it represents an ongoing exploration of how machine learning research and deployment might benefit by moving away from the imperative/object-oriented style of implementation and instead employing the functional pro...
متن کاملA Functional Programming Approach to Distance-based Machine Learning
Distance-based algorithms for both clustering and prediction are popular within the machine learning community. These algorithms typically deal with attributevalue (single-table) data. The distance functions used are typically hard-coded. We are concerned here with generic distance-based learning algorithms that work on arbitrary types of structured data. In our approach, distance functions are...
متن کاملA Reverse-Mode Automatic Differentiation in Haskell Using the Accelerate Library
Automatic Differentiation is a method for applying differentiation strategies to source code, by taking a computer program and deriving from that program a separate program which calculates the derivatives of the output of the first program. Because of this, Automatic Differentiation is of vital importance to most deep learning tasks as it allows for the easy backpropogation of complex calculat...
متن کاملTowards Running Parallel Programs on the Bare Metal via Virtualization
Decades of parallel computing practice have proven that highly parallel code runs efficiently only when it has uninterrupted access to the hardware. We report on a project whose goal is to support compiling Data Parallel Haskell code into bootable disk images. Our toolchain layers the Data Parallel Haskell runtime system on top of the GeekOS operating system and Newlib C library. We explain how...
متن کاملExperience Report: Verifying a Simple Compiler Using Property-based Random Testing
This paper reports on the use of the Haskell QuickCheck library for testing the correctness of a simple functional compiler and abstract machine. We use QuickCheck to express the correctness of the abstract machine against a denotational semantics, to generate wellformed test programs and to automatically shrink counterexamples obtained when a test fails.
متن کامل